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ABSTRACT 


The abundance of type and quantity of available data in the healthcare field has led many to utilize machine learning approaches to keep up with this influx of data. Data 
pertaining to COVID-19 is an area of recent interest. The widespread influence of the virus across the United States creates an obvious need to identify groups of 
individuals that are at an increased risk of mortality from the virus. We propose a so-called clustered random forest approach to predict COVID-19 patient mortality. 
We use this approach to examine the hidden heterogeneity of patient frailty by examining demographic information for COVID-19 patients. We find that our clustered 
random forest approach attains predictive performance comparable to other published methods. We also find that follow-up analysis with decision tree algorithms and 
linear regression provide insight into the type and magnitude of mortality risks associated with COVID-19. 
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I. INTRODUCTION: 

Coronavirus (COVID-19) started in China in December 2019. As of January 
2021, over 95 million cases have been reported around the world, with a mortal- 
ity rate of 2% of the total closed cases [1]. This rapid pandemic expansion repre- 
sents a global concern and a serious threat to the public health and economy 
worldwide. To prevent the infection from spreading, most countries restricted 
social interaction through precautionary measures such as isolation and quaran- 
tine. However, many infected patients did not benefit from the proper treatment 
due to late diagnosis and the novel and unknown nature of the virus. Recently, 
many researchers focused on developing new methodologies to screen infected 
patients in different stages to find notable associations between the patient's clini- 
cal features and the chances to succumb to the disease [2, 3]. Current investiga- 
tion studies determined that artificial intelligence (AI) and machine learning 
(ML) techniques can play a key role in reducing the effect of the virus spread 
[4-6]. ML application technologies on patients’ data fall under a range of 
different research directions [7]. One of the most important research directions is 
predicting the infection rate and mortality rate and building a model to classify 
patients based on their clinical findings [8, 9]. These research investigations are 
extremely important and would greatly assist people in the health sectors to be 
well prepared and take all necessary precautions to minimize the pandemic 
spread. 


The aim of this research is to develop a prediction model to calculate the severity 
of the disease in COVID-19 patients, using risk factors that can be monitored 
remotely, with the patient being at home. Moreover, the study explores the 
impact of vital signs, chronic diseases, preliminary clinical investigations, and 
demographic features to predict the survival versus the mortality of COVID-19 
patients. The study used COVID-19 patients' data from the King Fahad Univer- 
sity Hospital containing the clinical findings and demographic information to val- 
idate the model perfor- mance and effectiveness. All the risk factors or vital signs 
that can be measured through widely used sensors were included in the study 
such as oxygen level in the blood, temperature, pulse rate, and blood pressure. 
The model will serve as an early warning system to timely identify at-risk 
patients. 


IL RELATED WORK: 

Early detection and diagnosis using AI techniques help to prevent the spread and 
to combat the COVID-19 pandemic using different data such as CT scans, X-ray, 
clinical data, and blood sample data. 


Yan et al. [10] predicted the criticality and survival chances of patients with 
severe COVID-19 infection based on different risk factors and demographic 
information. The dataset used consists of 375 records from patients admitted to 
Tongji Hospital from January 10th to February 18th, 2020, including 201 survi- 
vors and 174 deceased within the same period. They used an XGBoost (XGB) 
model and identified only three main clinical features as significant, 1.e., lactic 
dehydrogenase (LDH), lymphocyte, and high-sensitivity C-reactive protein (Hs- 
CRP), selected from more than 300 features. The proposed model was validated 
using data from 29 patients. The key findings of the research were the model's 
ability to predict the risk of death with 0.95 precision and 0.90 prediction accu- 
racy. Such models will equip physicians with a tool for identifying critical condi- 
tions, thereby helping to reduce the mortality rate. Even though these findings are 
of great importance, the research has some limitations, which affect the accuracy 


of the reported results. These limitations were due to the small size of the dataset, 
namely, 29 records of patients only. 


Similarly, Wong and So [11] also used XGB with another dataset to predict the 

severe and the death cases and identify the risk factors associated with COVID- 

19. The dataset was retrieved from United Kingdom Biobank (UKBB) and in- 

cludes 93 different variables collected between 16 March 2020 and 19 July 2020. 

Two different studies have been conducted based on the sample's groups. For the 

first study, the data were clinical prediagnostic data of 1747 COVID-19 infected 

patient records containing both severe and death cases. For the severity class, the 

accuracy achieved was 0.668, and for the fatality class, the accuracy was 0.712. 

For the second study, the data were taken from the negative cases, the general pop- 
ulation with no COVID-19 infection, con- sisting of 489987 records. The same 

model was applied, and the accuracy achieved was similar to the first study, with 

an accuracy of 0.669 for the severity class and 0.749 for the fatality class, respec- 

tively. It is worth mentioning that the researchers identified the five most 

significant risk factors for severe cases and death cases, with age being the top fac- 
tor for both cases. Other factors include obesity, impaired renal function, multi- 

ple comorbidities, and cardiometabolic abnormalities. 


II. DATA SET: 

In this paper, I have used a dataset of more than 17365 laboratory-confirmed 
COVID-19 patients from 146 countries around the world including 307,382 
labeled samples containing both male and female patients with an average age of 
44.75. The disease was confirmed by detection of virus nucleic acid . The origi- 
nal dataset contained 32 data elements from each patient, including demographic 
and physiological data. At the data cleaning stage, we removed useless and 
redundant data elements such as data source, admin id, and admin name. We have 
also removed the unlabeled data samples. Then, data imputation techniques 
including mean/median/mode value replacement and KNN technique were used 
to handle missing values. 


To have an accurate and unbiased model, we made sure that our dataset is bal- 
anced. A balanced dataset with an equal number of observations for both recov- 
ered and deceased patients was created to train and test our model. The data sam- 
ples (patients) in the training dataset have been selected randomly and they are 
completely separate from the testing data. 


IV. DATA UNDERSTANDING AND PRE-PROCESSING: 

A. Data understanding: 

The purpose is to create a model to estimate house prices. We split the set of 
knowledge into functions and target variables. In this section, we aim to under- 
stand the overview and original features of the original data set, and perform 
exploratory analysis of the information set to obtain useful observations [1]. This 
dataset contains quite a few categorical variables that need to be converted to 
numeric form using label encoding or creating dummy variables. These are real 
variable placeholders, fake/dummy variables you create yourself. Also, there are 
a lot of null values and outliers, so you need to handle them accordingly. Bath, 
prices, and balcony features are numeric variables. Represented by functional cat- 
egory variables such as area_type, total_sqft, location, society, availability, and 
size[1]. 
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It can be seen that the price distribution is very different. The prices range from 8 
lakhs to 3600 lakhs. Most values are less than 500 lakhs. 


B. Data pre-processing: 

Preprocessing is one of the key steps in data analysis and prediction. Several pre- 
processing tech- niques were applied on the dataset. The dataset contains data of 
all the patients admitted in the hospital. Some symptoms or vital signs occurred 
with very low frequency and were therefore removed from the dataset. All symp- 
toms with occurrences at 50% or above were selected to be added to the feature 
set, while the symptoms with occurrences in the range from 2% to 49% were accu- 
mulated as one feature that was assigned a unique code. The first three vital signs: 
fever, cough, and shortness of breath (SOB) were defined as symptom features, 
while the remaining features were in- corporated as a new attribute 
“sym_ others.” 5% of the pa- tients in the study were asymptomatic at the time of 
initial diagnosis and considered as a part of the sym_others at- tribute. Similarly, 
the chronic top three (3) diseases (i.e., diabetes, high blood pressure, and cardiac) 
with the highest frequency were included as features. However, all other chronic 
disease types with more than | occurrence were incorporated as one feature 
“chr _others.” After the initial preprocessing data, an encoding scheme was 
applied on the categorical features. As the dataset contains a small number of 
missing values, imputation was performed using the decision tree algorithm. 


Vv. METHODOLOGY: 
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Fig. 2: Flowchart of Implementation 


Based on Fig.2, the process of regression analysis and decision tree algorithm 
is described in the following section: 


A. Basic Linear Regression Model: 

Linear regression is based on supervised learning. It performs the tasks to pre- 
dict a dependent variable value(Y) based on a given independent variable(X). 
It is the relationship between input (X) and output (Y). It is one of the most 
well- known and well-understood algorithms in machine learning[4]. A linear 
regression line has an equation of the form Y = a+ bX, where X is the explana- 
tory variable and Y is the dependent variable. The slope of the line is b, anda is 
the intercept (the value of y when x=0). 


B. Random Forest Tree algorithm: 

The Our method for predicting COVID-19 patient mortality in this project 
relies heavily on the random forest (RF) classifier from Breiman. Conse- 
quently, a brief description of this method is appropriate. The RF classifier is 
itself made up of many decision trees. A decision tree classifier is made by suc- 
cessively splitting our data at decision nodes according to feature values. Our 
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initial decision node splits the data into two groups according to a cutoff value 
for one of the data features. Then these groups are again split by a decision 
node, and this process continues, building out the “branches” of the decisions 
tree. When the splitting stops, the last remaining groups, or “leaves” of the 
decision tree provide the designation for which class individuals in that group 
belong to. The feature used at each decision node to split the data is typically 
chosen so that error at that step is minimized. The RF classifier uses an ensem- 
ble “forest” of these decision trees to make its classifications. Each tree in the 
RF ensemble is built using a bootstrapped random sample of the available data 
and considering only a random selection of available features when each split- 
ting node is made. To classify an observation with the RF model, each decision 
tree in the ensemble “votes” for the class it predicts and the majority vote of 
the decision trees in the ensemble is the class that the RF classifier predicts. 
This reliance on the majority vote to classify an observation provides for better 
performance than a single decision tree classifier. 


C. Decision Tree Algorithm: 

An object that trains a tree-structured model to predict future data in order to 
produce meaningful continuous output. Decision trees, steps related to regres- 
sion are the basic concepts of decision trees, maximizing information acquisi- 
tion, classification trees, and regression trees. The basic concept of a decision 
tree consists of recursive partitioning. The root node, known as the parent 
node, can split each node into child nodes. These nodes can be the parent node 
of the resulting child node. Optimization of information gain tree learning 
algorithms are defined as functional nodes useful for defining objective func- 
tions. 


VI. CONCLUSION: 

The system uses this data in the most efficient way. Linear regression algorithms 
help satisfy customers by increasing the accuracy of real estate selection and 
reducing the risk of real estate investments. Many features that can be added to 
make the system more widely accepted. One of the major future scopes is to add 
more city real estate databases. This allows users to explore more properties and 
make informed decisions. More factors should be added, such as a recession that 
affects house prices [2]. Add detailed details for all real estate to provide detailed 
information on the desired real estate sample. This helps the system to run at a 
higher level. 


In this paper, an overview of the concept ofmachine learning along with its 
various applications is discussed[5]. Taking data samples for houses and con- 
sidering its various attributes, house prices were predicted using machine learn- 
ing regression methods to predict the price of the property by using previous data 
and to check quality of solution or output. Data modeling and analysis of these 
jobs have a range of for future applications in a flat value prediction system. 
Based on the results, it can be concluded that forecasts Machine Learning 
focused guess is understandably and meaningful to data analysis points from 
view. When done correctly the ratio can be achieved is high or is exactly, and thus 
Machine Learning techniques find applications in many fields. 
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